Skip to main content

Full text of "Commentary on 'Preliminary Species Hypotheses' in Entomological Taxonomy: A Global Data and FAIR Infrastructure Perspective"

See other formats


Biodiversity Data Journal 13: e141562 
doi: 10.3897/BDJ.13.e141562 


Commentary on "Preliminary Species Hypotheses" 
in Entomological Taxonomy: A Global Data and 
FAIR Infrastructure Perspective 


Sharif Islam * § 


+ Naturalis Biodiversity Center, Leiden, Netherlands 
§ DiSSCo, Leiden, Netherlands 


Corresponding author: Sharif Islam (sharif.islam@naturalis .nl) 

Academic editor: Lyubomir Penev 

Received: 11 Nov 2024 | Accepted: 21 Jan 2025 | Published: 10 Feb 2025 

Citation: Islam S (2025) Commentary on "Preliminary Species Hypotheses" in Entomological Taxonomy: A 
Global Data and FAIR Infrastructure Perspective. Biodiversity Data Journal 13: e141562. 

https ://doi.org/10.3897/BDJ.13.e141562 


Abstract 


What if early taxonomic findings were treated like preprints, open to iterative improvement 
or managed with practices from the open-source community, such as Git branching, 
merging and patch management? Prompted by Buckley's article Charting a Future for 
Entomological Taxonomy in New Zealand (2024), this commentary explores these 
possibilities in the context of biodiversity informatics. In response to the need for rapid, 
scalable biodiversity monitoring, Buckley introduces preliminary species hypotheses 
(PSH) as a bridge between quick identification tools and the rigorous Linnaean system, 
leveraging DNA barcoding and Al-assisted image recognition to produce provisional 
Classifications that can later be validated. Expanding on Buckley’s framework, this 
commentary emphasises the critical role of data linking, versioning and integration to 
support evolving taxonomic data. Borrowing from software and open-source practices, | 
explore the idea of managing PSH with an infrastructure that treats each taxonomic 
update as a versioned "commit", which can be tracked, refined and integrated over time. 
Drawing insights from FAIR (Findable, Accessible, Interoperable, Reusable) principles 
and Digital Extended Specimens, | identify infrastructure requirements for PSH, including 
robust data standards, persistent identifiers and interoperability to support global 
biodiversity repositories. Additionally, Taxonomic Data Objects offer a model for 


©@lslam S. This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), 
which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are 
credited. 


2 Islam S 


dynamically integrating PSH into adaptable taxonomies that can evolve with new data 
and tools. By positioning PSH within an open, infrastructure-focused framework, this 
commentary advocates for scalable, hypothesis-driven biodiversity data that meets 
modern conservation needs, bridging traditional and emerging practices in taxonomy. 


Keywords 


taxonomy, species, interoperability, FAIR, data integration, open source 


Introduction 


In Charting a Future for Entomological Taxonomy in New Zealand, published in the 
journal New Zealand Entomologist, T.R. Buckley (2024) proposes the concept of 
preliminary species hypotheses (PSH) as a way to bridge the gap between the need for 
rapid species identification and the rigorous Linnaean taxonomy. Buckley argues that 
PSH can address biodiversity monitoring needs by utilising output of rapid identification 
tools -- such as DNA barcoding and Al-assisted image recognition - as provisional 
Classifications that serve as an intermediate stage before formal taxonomic classification. 
Although based in New Zealand and focused on entomology, this proposal has 
implications for other regions and fields within taxonomy and biodiversity research. 


Buckley's proposal envisions scalable, hypothesis-driven biodiversity data that can 
evolve as new information emerges. Inspired by this approach, we might ask: What if 
early taxonomic findings were treated like preprints (Verma and Detsky 2020) - open to 
iterative improvement? Or managed through practices adapted from open-source 
software development, such as Git branching, merging and patch management, where 
each PSH acts as a versioned "commit"? While this iterative process echoes how 
taxonomic science has traditionally progressed, this approach could offer a flexible 
framework for tracking and refining taxonomic data over time. 


In this commentary, | explore the data linking and integration infrastructure required to 
support Buckley’s vision, emphasising how PSH fits within the broader framework of the 
Species Hypotheses (SH) and Taxon Hypotheses (TH) paradigm. These concepts also 
align with evolving Biodiversity Information Standards (TDWG) standards like the Taxon 
Concept Schema (TCS), which separates Taxon Concepts from Taxon Names to 
enhance data interoperability (Klazenga and Liljeblad 2024). Based on these conceptual 
frameworks, | explore the data linking and integration infrastructure required to support 
Buckley's vision, drawing on knowledge infrastructure studies such as Christina 
Borgman's work on data systems (Borgman and Brand 2024) and Sterner et al.'s 
pluralistic framework for biodiversity data sharing (Sterner et al. 2023). | also consider 
recent proposals, such as Digital Extended Specimens (Hardisty et al. 2022) and 
Taxonomic Data Objects (Upham and Poelen 2024), as potential models for integrating 
PSH and other hypotheses-driven insights (both from molecular- and media-based 
multimodal workflow) as data products within global biodiversity infrastructures. This 


infrastructure-based approach can help sustain taxonomy's relevance in conservation 
and research. My focus is not on assessing the scientific rigour of PSH, but rather on the 
data linking and integration strategies that could underpin its implementation, offering a 
scalable pathway for the evolution of taxonomic knowledge. 


Summary of paper 


Buckley’s proposal introduces PSH as a practical and flexible way to address the gap 
between rapid biodiversity identification needs and the more formal Linnean 
classification system. The paper presents the proposal with a historical background of 
entomological taxonomy in New Zealand, discussing the reasons for declining taxonomy 
funding and the importance of maintaining scientific rigour. For relevance, the summary 
here highlights the key concepts of PSH. 


This provisional approach of PSH aligns well with similar concepts, such as Operational 
Taxonomic Units (OTUs) in DNA barcoding, which serve as proxies to categorise 
unidentified taxa for integration across different biodiversity datasets and use cases. 
According to Buckley, OTUs are typically molecular-based groupings (often not derived 
from DNA-sequenced specimens, particularly for environmental DNA data). As Buckley 
(2024):9 states: 


“_.it is difficult to reconcile these OTUs [OTUs that are not derived from DNA 
sequencing specimens and do not have a physical reference specimen] with 
other types of character data. From a hypothesis testing perspective, these OTUs 
can also be considered ‘preliminary species hypotheses’, but with a weaker 
degree of support than from specimen-based DNA sequencing approaches (as 
outlined earlier). This approach will require a large-scale eDNA survey of New 
Zealand, focusing on the sampling of soil, water, air and insect trap residues. 
Achieving this goal would also be a 5- to 10-year project with a moderate financial 
investment. The output would be a comprehensive database of OTUs that, over 
time, could be connected to described species or to DNA sequences obtained 
from individual specimens". 


In contrast, PSH are structured as an intermediate classification that is less formal than 
Linnaean taxonomy, but aspires to achieve it over time. Unlike OTUs, PSH are not simply 
molecular clusters; they are hypotheses that can later be validated and incorporated into 
formal taxonomy as additional data become available. Buckley also reminds us in the 
paper that similar methods are commonly used in fields of mycology (KOljalg et al. 2013) 
and bacteriology. While OTUs offer a rapid and flexible tool for biodiversity estimation, 
PSH are designed to be a step closer to formal species recognition, enabling hypothesis- 
driven research and prioritisation without bypassing rigorous taxonomic standards 
entirely: 


“The goal is not to replace the Linnean system, or to lower its scientific 
robustness, but to provide a framework for describing biodiversity more quickly 


4 Islam S 


than Linnean taxonomy can. DNA data can characterise lineages that, in turn, can 
be considered as ‘preliminary species hypotheses’. These hypothesised species 
can be tested, verified and described by taxonomists later if resources become 
available. In the meantime, these hypothesised species can be used as a basis in 
downstream conservation actions or ecological studies that require biodiversity to 
be divided into scientifically meaningful entities. However, it must be remembered 
that these hypothesised species have not been subject to robust testing and, 
therefore, any downstream inference will not be as reliable as that from a fully 
revised taxon” (Buckley 2024:8). 


Furthermore, the robustness and benefit of the hypothesis-driven and iterative approach 
come not just from a single data type, but from integrating a variety of data types. For 
instance, combining molecular-based methods and multi-modal Al techniques can 
significantly reduce uncertainties in the inference of observations: 


"If we want robust species hypotheses, then large numbers of characters will 
continue to be needed. There are technologies emerging that promise to greatly 
increase the rate of data collection without sacrificing scientific robustness. The 
approach adopting these technologies is known as /arge-scale integrative 
taxonomy (Hartop et al. 2022; Salili-James et al. 2023; Karbstein et al. 2024). 
Briefly, this approach comprises two steps. First, high throughput methods are 
used to collect character data and perform a provisional grouping of specimens 
into putative species. Second, another character type, with a high a pron 
probability of being incongruent with the first character set, is used to test those 
putative species (Hartop et al. 2022). A key feature is the use of technology to 
accelerate the rate and scale of data collection" (Buckley 2024:7). 


The demand for taxonomic information for a variety of use cases (Such as environmental 
monitoring and biosecurity) is rising, making traditional insect sampling and identification 
methods increasingly impractical, especially amidst a shortage of experts. New 
technologies, including DNA barcoding, eDNA for community assessments and 
automated image recognition, offer promising alternatives that can democratise species 
identification. Automated image recognition, in particular, enables non-specialists to 
identify insects, making taxonomy more accessible. However, according to Buckley, 
successful adoption of these tools requires extensive digitisation of specimen records 
and integration with images, DNA sequences and geo-referenced data. 


Key Terms, Definitions and Alignment with Existing Concepts 


The practice of taxonomy and nomenclature deals with different concepts and terms 
beyond naming species (see Favret (2024) for 5 'D's of taxonomy: delimitation, diagnosis, 
description, determintation and discovery) where the aspect of testable hypotheses 
intersects all of these concepts. While detailing every aspect is beyond the scope of this 
commentary, this section defines key terms and situates them within the evolving 
landscape of biodiversity informatics. The following concepts are briefly stated here to 


facilitate the discussion and lay the foundation for understanding how PSH can integrate 
into taxonomic workflows and biodiversity data infrastructures. 


Barcode Index Numbers (BINs): BINs are molecular-based clusters derived from DNA 
barcoding, primarily serving as proxies for species identification using genetic 
divergence thresholds. BINs are similar to OTUs, but are specific to DNA barcoding. 
Unlike OTUs, which are often used as an intermediate step requiring further species-level 
identification, BINS are dynamic and the boundaries of what sequences can be 
associated with a particular BIN can change with new sampling data (Ratnasingham and 
Hebert 2013; Lue et al. 2022) and one BIN can cover more than one taxon (Huemer and 
Mutanen 2022). 


Species Hypotheses (SH): SH is the main building block of UNITE (a database and 
sequence management environment centred on the eukaryotic nuclear ribosomal ITS 
region) which groups similar sequences into provisional species-level clusters typically 
comprising two or more sequences to avoid excessive inflation (Kdljalg et al. 2013). 
Representative sequences for each SH are chosen through consensus computation or 
expert designation. These SHs, along with their representative sequences and 
annotations, are made available as reference datasets. Buckley's paper discusses SH 
used in mycology and explores how entomology can adapt similar ideas. This discussion 
also opens up the possibility of integrating SH concepts for broader use beyond 
mycology and zoology, not necessarily limited to DNA-based identification methods. 


Taxon Hypotheses (TH) paradigm: Expanding on the SH concept, Koljalg et al. (2020) 
introduces the TH paradigm that represents a framework for linking sequence-based 
identifications to taxonomic concepts. By assigning Digital Object Identifiers (DOls) to 
these hypotheses, THs enable transparent and reproducible connections between 
molecular data and taxonomic classifications. Kdljalg et al. (2020) also highlights that, 
while molecular data are becoming increasingly common, differences in sampling, 
genetic markers and analytical methods often lead to competing and sometimes 
conflicting classifications. The reference datasets and DOls provided by UNITE offer a 
unique reference point that remains consistent even as underlying data and conclusions 
evolve. This system allows users to reference the data enabling modifications and 
augmentations, while preserving original versions. 


All of these frameworks have one thing in common: they acknowledge the dynamic and 
"preliminary" nature of initial insights into species identification. Thus, PSH or SH could 
emerge as a new "data type" that can be used not just in mycology or zoology, but across 
domains. Furthermore, this approach supports integrative methods that apply multiple 
types of characters, leading to robust hypothesis tests and, therefore, greater confidence 
in the acceptance or rejection of a species hypothesis. 


Recent discussions (see Karbstein et al. (2024)) on species delimitation and Al also 
underscore the importance of incorporating multiple data types and frameworks such as 
unified species concept, morphological and phylogenetic (genetic relationships and 
shared ancestry) and DNA clustering methods that are going towards a more integrative 


6 Islam S 


approach (genetics/genomics + morphology + ecology). Al-based identification methods, 
including multimodal approaches involving sound and vision, are also becoming 
increasingly prevalent (Waldchen and Mader 2018; Yang et al. 2021). Each approach 
has limitations; thus, integrative approaches that combine multiple lines of evidence align 
with the dynamic nature of species hypotheses. 


By situating PSH, SH, TH, BINs and OTUs within a unified conceptual framework, this 
commentary underscores the value of treating species hypotheses as dynamic, evolving 
data objects. Each concept - BINs, OTUs, SH, TH and PSH - has distinct origins rooted in 
specific fields, such as molecular biology, fungal taxonomy and entomology. These 
approaches complement the Linnaean classification by integrating preliminary taxonomic 
data into an iterative process that refines and validates hypotheses over time. Expanding 
their application to encompass diverse data types will enhance their utility across 
taxonomic domains. A holistic and integrative approach supports the iterative refinement 
of taxonomies while balancing the need for rapid discovery with the production of robust, 
high-quality data. 


The role of infrastructures 


Following the summary of Buckley’s PSH proposal, it becomes clear that data integration 
and linking will be an important aspect and, thus, the successful implementation and 
sustainability of PSH require a robust digital infrastructure. This infrastructure not only 
enables data sharing, but also supports the evolution of taxonomic knowledge in a 
scalable and accessible way. The PSH model is comparable to preprints in scholarly 
publishing: it provides a way to make new insights accessible, citable and linkable, even 
if they require further refinement and validation. When viewed through the lens of the 
Digital Extended Specimen (DES) paradigm (Hardisty et al. 2022) and the FAIR 
(Findable, Accessible, Interoperable, Reusable) principles, the PSH concept highlights 
the need for infrastructure that can support both provisional classifications and long-term 
taxonomic research. The intersection of PSH with DES and FAIR principles underscores 
the challenges - and critical importance - of establishing, maintaining and scaling digital 
infrastructure to meet the demands of modern biodiversity research. This is not to argue 
for a new type of digital infrastructure, but improving on existing infrastructures and 
aligning global and regional funding schemes that can be adopted to implement such a 
proposal. Similar to Buckley, Meier et al. (2024) also emphasise that achieving 
integrative taxonomy (combining morphological, whole organism study with molecular 
data) requires reliable data handling, including efficient voucher storage, standardised 
data practices and FAIR-compliant infrastructure to support the evolution of taxonomic 
hypotheses as new data are added. 


For biodiversity data to be effective, including taxonomic and nomenclature information, a 
resilient infrastructure is crucial to maintain links amongst evolving species hypotheses, 
underlying specimens, environmental observations and genetic data. Efforts to create 
such infrastructures have accelerated globally as we confront biodiversity and climate 
crises (Devictor and Bensaude-Vincent 2016). Although global data infrastructures that 


support biodiversity data and research funding are unevenly distributed, the DES and 
PSH approach could mitigate disparities by providing an inclusive, interoperable system 
that enables biodiversity data sharing across regions and disciplines. 


The DES, as proposed, is a paradigm for digitally linking specimen data from global 
natural science collections to related taxonomic, ecological and environmental data. DES 
enables the transformation of physical specimen data into digital objects, making them 
accessible and FAIR. This approach not only broadens usability, but also enhances the 
value of collections by integrating them into global data infrastructures that can be 
leveraged for large-scale, multifactor analysis (Heberling et al. 2021). Thinking about 
DES, PSH and FAIR in a holistic framework brings up the notion of pluralistic data 
pooling advocated by Sterner et al. (2023):2: 


We define ‘data pooling’ for biodiversity data as a process that combines data 
from multiple sources into one taxonomically standardized body of information, 
provides infrastructure for managing and accessing the combined data and 
governs it as a shared resource for a community of users and stakeholders 
beyond a single research project or lab. We define ‘taxonomic standardization’ as 
a set of processes for verifying and re-identifying a collection of species 
observations as needed to ensure that they are classified in a standardized way 
according to a single, coherent taxonomy of choice. More generally, ‘data 
standardization’ (also Known as data harmonization) is an established term in 
academic and industry data science practices. 


Part of this set of process can be a PSH data element that can accommodate evolving 
taxonomic concepts, while ensuring reliable links between data sources. It allows for both 
the robustness of Linnean taxonomy and the flexibility of documenting hypotheses, 
thereby fostering a dynamic approach to biodiversity research. Echoing Sterner (also 
Leonelli (2020) and Borgman and Wofford (2021)), the challenges of biodiversity data 
collection, sharing and preservation are as much social as technical, thus: 


“...making biodiversity data comprehensively available and reusable will likely 
require major changes to the cultures, organizations and infrastructures of the 
research communities involved” (Sterner et al. 2023: 2). 


This also brings up the notion of maintenance and support. As Borgman et al. (2016) 
note, "durability" in infrastructure requires continuous maintenance across technical and 
human resources. Applying this insight to biodiversity data infrastructure highlights that 
building a sustainable, FAIR-compliant system requires not only technical innovation, but 
also governance and investment. Borgman’s work in astronomy shows that even well- 
established systems still face fragility without regular support - an important reminder as 
we build infrastructures that will support biodiversity data on a global scale. 


8 Islam S 


Integration with Global Data Standards and Networks 


As mentioned already, PSH can expand beyond New Zealand and entomology; it has 
potential for integration with global biodiversity data initiatives. Organisations and 
platforms such as the Catalogue of Life, GBIF, BCON, ALA, INSDC, BOLD, UNITE and Di 
SSCo provide frameworks, tools and services for aggregating and curating biodiversity 
data, which could be expanded to incorporate PSH as a new type of digital object. By 
embedding provisional species data into the global biodiversity network, PSH could 
become widely accessible and actionable across regions and disciplines. 


As Moersberger et al. (2024) emphasise in their study on European biodiversity 
monitoring, integrating biodiversity data is crucial for reducing fragmentation and filling 
taxonomic gaps. Aligning PSH with the shift toward digital taxonomy could further bridge 
the divide between morphological and molecular approaches, providing traceable, 
reusable links to each hypothesis’s provenance. This would enable a more cohesive and 
adaptable taxonomy, supporting dynamic updates as new data become available. 


Enhancing PSH with FAIR Compliance 


To fully realise PSH, we need infrastructure that is both accessible and FAIR-compliant. 
These hypotheses will function as data points or nodes within a knowledge graph (Page 
2019,Penev et al. 2024) and, because they could be stored across multiple 
infrastructures (Sterner et al. 2020), data linking and interoperability are essential. The 
Upham and Poelen (2024) concept of Taxonomic Data Objects aligns with this need by 
offering machine-readable digital packages that encode metadata, enabling the tracking 
of evolving species concepts over time. Initial taxonomic data can also be compared to a 
software commit in Git: each PSH represents a specific "state" of species classification, 
preserving the evolution of taxonomic understanding without overwriting earlier 
hypotheses. This approach provides a clear pathway for reviewing and merging 
provisional classifications with established taxonomies, strengthening taxonomic 
workflows by ensuring data integrity and interoperability across different taxonomic 
systems (see Fig. 1 for a simple schematic comparing Git merging with the process 
described using PSH). 


Practical Requirements for Preliminary Species Hypotheses 
Implementation 


For PSH to serve as a valuable tool in taxonomy and biodiversity informatics, certain key 
elements are essential. This is an initial proposal and will benefit from further discussion: 


1. Persistent Identifiers (PIDs): Each PSH digital object should be assigned a PID 
to ensure reliable tracking and referencing, similar to the approach used for 
Digital Extended Specimens within the FAIR Digital Object framework (Islam et al. 
2023). As suggested by Upham and Poelen (Upham and Poelen 2024), 


versioning and hashing could be incorporated as part of the metadata to support 
tracking changes over time. Assigning PIDs to taxonomic data and hypotheses is 
not a new concept; for example, the Catalogue of Life assigns identifiers for name 
usage and checklists (Banki et al. 2023) and UNITE assigns DOls to species 
hypotheses (KOljalg et al. 2020). The discussion should not focus on which 
specific PID mechanism is optimal - though implementation details are important - 
but rather on establishing a consensus and actionable plan to assign PIDs to 
these entities at a granular level. This will enable effective tracking and linking, 
but requiring dedicated infrastructure and ongoing maintenance support. By 
assigning transparent and persistent identifiers to contributors across all stages of 
a species hypothesis’ evolution, the framework could foster equitable recognition 
while maintaining rigorous standards for formal naming. 


Taxonomic Workflow 


Specimen Collection & Environmental DNA & Multi 


Git Ideas Morphological Data Modal Data 
Main Taxonomic Data | | 
Branch Initial Identification initial Identification 
? i | 
/ \ 
{ \ Species Hypothesis: SH2 Species Hypothesis: SH1 
’ , 
Hypotheses Branch 2 Hypotheses Branch 1 % 
= ye 
/ ~ ~ 
4 SH Integrative Data 
a Pe 
Hypotheses Sub-branch 2.1 
¥ 
Provisional Data Integration 
¥ 
| 
Combined Hypotheses 
Branch | 
¥ 
FAIR and Data Standards 
Compliance/ Taxon 
’ Concept Schema Usage 
FAIR and Data Standards | 
Alignment/ Taxon Concept | 
Schema Usage Linked with PIDs 
¥ 
Linked with PIDs Long term Data Repository 
’ 
Long Term Data Repository | 
® 
Revised SH with Additional 
Reviewed & Accepted Data 


v | 
Refined Taxonomic Data Reviewed & Accepted 
¥ 


Refined Taxonomic Data 


Figure 1. EE 


A simplified conceptual framework for version-controlled taxonomic data 
management 


This diagram illustrates the parallel between hypothesis-driven taxonomic workflows and Git- 
based version control systems. Drawing inspiration from software development practices, the 
framework demonstrates how version control concepts could be applied to manage and track 
the evolution of taxonomic hypotheses. The actual processes involved are much more 
complex, as described in Pyle's paper "An Introduction to Scientific Names of Organisms and 
the Taxon Concepts they Represent (Pyle 2022). 


10 


Islam S 


Interoperable Data Standards: Standards like Darwin Core and Taxon Concept 
Schema (TCS) are necessary to harmonise species hypothesis data with other 
biodiversity data types, such as observation and occurrence data. Consistent 
standards enable smoother integration and reuse of taxonomic information across 
platforms. How a preliminary concept could be part of Darwin Core and other 
standards framework will need careful consideration. For instance, “dwc:previous 
Identifications” property in Darwin Core could store the reference to preliminary 
data . PSH, SH and TH could have their own data model and metadata, but this 
also needs global consensus. As new data and insights are being generated, 
standards and schemas are essential for usability in diverse contexts. While 
Darwin Core is widely used, TCS’s separation of Taxon Concepts from Taxon 
Names allows greater flexibility for mapping and resolving taxonomic data. TCS 
could possibly accommodate dynamic states such as "Preliminary" and "Final" as 
new insights emerge. It could also address provenance and attribution, akin to the 
Linnaean tradition of authorship, requiring each state to have a source 
("according To") (Klazenga and Liljeblad 2024). 

FAIR Principles: Along with PIDs, machine-readable formats and data standards 
will enhance accessibility, interoperability and reusability, supporting transparent 
and evolving taxonomic classifications. Similar ideas have been proposed by 
Miralles et al. (2020) in the context of alpha taxonomy repositories. Taxonomic 
Data Objects (Upham and Poelen 2024) could standardise PSH data in a 
machine-readable format, preserving their structure and allowing flexible data 
use. 

Global Coordination and open source practices: Collaborative efforts with 
established networks are essential for integrating PSH into a global biodiversity 
framework. Beyond achieving consensus on metadata standards, the accessibility 
and publication of these data must remain a priority. Funders, research institutions 
and collection-holding organisations need to recognise the importance of APIs ( 
Addink et al. 2023), repositories, data stewardship (De Prins 2019Bentley et al. 
2024) and other foundational infrastructure and commit both human and 
technological resources to support them. This is especially crucial given that 
many countries, despite their reliance on biodiversity data for modelling and 
monitoring, often lack the necessary capacity, expertise or funding to fully exploit 
its potential (WMoersberger et al. 2024). As illustrated by New Zealand's example, 
where a small population and limited taxonomic expertise hinder the 
development of comprehensive taxonomic research, many countries depend on 
international collaboration for taxonomic knowledge. Addressing this taxonomic 
impediment calls for capacity building, knowledge exchange and the creation of 
sustainable, FAIR-aligned taxonomic services through coordinated efforts ( 
Buckley 2024). A unified global solution may be impractical, yet stronger 
coordination in the software and standards that support taxonomic services is 
critical. This can facilitate the effective use of new data elements like PSH and 
promote shared governance structures. For instance, the discussions by Sandall 
et al. (2023) on checklist maintenance can be extended to taxonomic software 
and service development, where PSH could be tested and refined. Capacity 


11 


management and funding challenges also require open dialogue, especially 
given the voluntary nature of many contributions in taxonomy and also in 
biodiversity informatics and data stewardship. Metrics from open-source projects, 
such as the "Contributor Absence Factor" (or "Bus Factor") - which assesses how 
many contributors can be lost before a project is impacted - could help guide 
efforts towards sustainability. By learning from open-source practices and 
research software sustainability principles (Cohen et al. 2021), we can enhance 
taxonomy's resilience and interoperability across regions. While taxonomic 
expertise remains indispensable, adopting insights from open-source and other 
data ecosystems will help us to overcome challenges in data infrastructure and 
interoperability. 


Conclusion 


Buckley's concept of PSH, primarily proposed within entomology, parallels existing 
frameworks like SH in mycology and BINs and OTUs from molecular methods. Despite 
their overlaps and distinctions, the need for standardised frameworks to manage 
preliminary and evolving taxonomic data remains crucial. These frameworks address 
challenges across diverse taxonomic domains, emphasising their potential to create 
interoperable and dynamic taxonomic practices, but a wider and global discussion is 
needed to find a holistic solution. 


In the context of New Zealand, Buckley advocates for shifting entomological taxonomy 
away from the primary focus on completing Linnaean classification. Instead, his proposal 
highlights achievable objectives aligned with realistic funding and _ timelines, 
incorporating DNA data and Al methods as preliminary steps towards formal 
classification. This commentary connects Buckley's proposal to broader initiatives, such 
as FAIR principles, Digital Extended Specimens, Taxon Concept Schema, Taxonomic 
Data Objects and open-source software practices. By treating PSH as data points - 
similar to versioned git "commits" or "preprints" - species identification and classifications 
can be iteratively refined without losing historical data. This fosters a more adaptable and 
integrative approach to taxonomy, bridging morphological and molecular data and Al- 
based identification, while enhancing global biodiversity conservation efforts. 


Conflicts of interest 


The authors have declared that no competing interests exist. 


References 


° Addink W, Kyriakopoulou N, Penev L, Fichtmueller D, Norton B, Shorthouse D (2023) 
Deliverable D1.3 Best practice manual for findability, re-use and accessibility of 
infrastructures. ARPHA Preprints_httos://doi.org/10.3897/arphapreprints.e107169 


12 


Islam S 


Banki O, Déring M, Jeppesen T (2023) Name IDs and Name Matching for Catalogue of 
Life: Existing Services and Prospects. Biodiversity Information Science and Standards 7 
https ://doi.org/10.3897/biss.7.111662 

Bentley A, Thiers B, Moser WE, Watkins-Colwell GJ, Zimkus BM, Monfils AK, Franz NM, 
Bates JM, Boundy-Mills K, Lomas MW, Ellwood ER, Poo S, Contreras DL, Webster MS, 
Nelson G, Pandey JL (2024) Community Action: Planning for Specimen Management in 
Funding Proposals. BioScience 74 (7): 435-439. https://doi.org/10.1093/biosci/biae032 
Borgman C, Sands A, Darch P, Golshan M (2016) The durability and fragility of 
knowledge infrastructures: Lessons learned from astronomy. Proceedings of the 
Association for Information Science and Technology 53 (1): 1-10. https://doi.org/10.1002/ 
pra2.2016.14505301057 

Borgman C, Wofford M (2021) From Data Processes to Data Products: Knowledge 
Infrastructures in Astronomy. Harvard Data Science Review https ://doi.org/ 
10.1162/99608f92.4e792052 

Borgman C, Brand A (2024) The Future of Data in Research Publishing: From Nice to 
Have to Need to Have? Harvard Data Science Review https://doi.org/ 
10.1162/99608f92.b73aae77 

Buckley TR (2024) Charting a future for entomological taxonomy in New Zealand. New 
Zealand Entomologist1-17. https://doi.org/10.1080/00779962.2024.2407230 

Cohen J, Katz D, Barker M, Chue Hong N, Haines R, Jay C (2021) The Four Pillars of 
Research Software Engineering. IEEE Software 38 (1): 97-105. httos://doi.org/10.1109/ 
ms .2020.2973362 

De Prins J (2019) Global Open Biodiversity Data: Future Vision of FAIR Biodiversity 
Data Access, Management, Use and Stewardship. Biodiversity Information Science and 
Standards 3 https://doi.org/10.3897/biss.3.37190 

Devictor V, Bensaude-Vincent B (2016) From ecological records to big data: the invention 
of global biodiversity. History and Philosophy of the Life Sciences 38 (4). https://doi.org/ 
10.1007/s40656-016-0113-2 

Favret C (2024) The 5 ‘D’s of Taxonomy: A User’s Guide. The Quarterly Review of 
Biology 99 (3): 131-156. https ://doi.org/10.1086/732044 

Hardisty AR, Ellwood ER, Nelson G, Zimkus B, Buschbom J, Addink W, Rabeler RK, 
Bates J, Bentley A, Fortes JAB, Hansen S, Macklin JA, Mast AR, Miller JT, Monfils AK, 
Paul DL, Wallis E, Webster M (2022) Digital Extended Specimens: Enabling an 
Extensible Network of Biodiversity Data Records as Integrated Digital Objects on the 
Internet. BioScience 72 (10): 978-987. https://doi.org/10.1093/biosci/biacO60 

Heberling JM, Miller J, Noesgaard D, Weingart S, Schigel D (2021) Data integration 
enables global biodiversity synthesis. Proceedings of the National Academy of Sciences 
118 (6). https ://doi.org/10.1073/pnas.2018093118 

Huemer P, Mutanen M (2022) An Incomplete European Barcode Library Has a Strong 
Impact on the Identification Success of Lepidoptera from Greece. Diversity 14 (2). https:// 
doi.org/10.3390/d14020118 

Islam S, Beach J, Ellwood E, Fortes J, Lannom L, Nelson G, Plale B (2023) Assessing 
the FAIR Digital Object Framework for Global Biodiversity Research. Research Ideas 
and Outcomes 9 https://doi.org/10.3897/rio.9.e108808 

Karbstein K, Késters L, Hoda¢ L, Hofmann M, Hérandl E, Tomasello S, Wagner N, 
Emerson B, Albach D, Scheu S, Bradler S, de Vries J, Irisarri 1, Li H, Soltis P, Mader P, 
Waldchen J (2024) Species delimitation 4.0: integrative taxonomy meets artificial 


13 


intelligence. Trends in Ecology & Evolution 39 (8): 771-784. httos://doi.org/10.1016/j.tree. 
2023.11.002 

Klazenga N, Liljeblad J (2024) Expressing Circumscription in the Taxon Concept Schema 
(TCS). Biodiversity Information Science and Standards 8 htips://doi.org/10.3897/biss. 
8.140738 

KOljalg U, Nilsson RH, Abarenkov K, Tedersoo L, Taylor AS, Bahram M, Bates S, Bruns 
T, Bengtsson-Palme J, Callaghan T, Douglas B, Drenkhan T, Eberhardt U, Duenas M, 
Grebenc T, Griffith G, Hartmann M, Kirk P, Kohout P, Larsson E, Lindahl B, LUcking R, 
Martin M, Matheny PB, Nguyen N, Niskanen T, Oja J, Peay K, Peintner U, Peterson M, 
Példmaa K, Saag L, Saar I, Schuler A, Scott J, Senés C, Smith M, Suija A, Taylor DL, 
Telleria MT, Weiss M, Larsson K (2013) Towards a unified paradigm for sequence-based 
identification of fungi. Molecular Ecology 22 (21): 5271-5277. htips://doi.org/10.1111/mec. 
12481 

K6ljalg U, Nilsson H, Schigel D, Tedersoo L, Larsson K, May T, Taylor AS, Jeppesen TS, 
Frgslev TG, Lindahl B, Poldmaa K, Saar |, Suija A, Savchenko A, Yatsiuk |, Adojaan K, 
Ivanov F, Piirmann T, P6hdnen R, Zirk A, Abarenkov K (2020) The Taxon Hypothesis 
Paradigm—On the Unambiguous Detection and Communication of Taxa. Microorganisms 
8 (12). https://doi.org/10.3390/microorganisms8121910 

Leonelli S (2020) Learning from Data Journeys. Data Journeys in the Sciences1-24. 

https ://doi.org/10.1007/978-3-030-37177-7_1 

Lue C, Abram P, Hrcek J, Buffington M, Staniczenko PA (2022) Metabarcoding and 
applied ecology with hyperdiverse organisms: Recommendations for biological control 
research. Molecular Ecology 32 (23): 6461-6473. https://doi.org/10.1111/mec.16677 
Meier R, Lawniczak MN, Srivathsan A (2024) Illuminating Entomological Dark Matter with 
DNA Barcodes in an Era of Insect Decline, Deep Learning, and Genomics. Annual 
Review of Entomology _https://doi.org/10.1146/annurev-ento-040124-014001 

Miralles A, Bruy T, Wolcott K, Scherz MD, Begerow D, Beszteri B, Bonkowski M, Felden 
J, Gemeinholzer B, Glaw F, Gléckner FO, Hawlitschek O, Kostadinov |, Nattkemper TW, 
Printzen C, Renz J, Rybalka N, Stadler M, Weibulat T, Wilke T, Renner SS, Vences M 
(2020) Repositories for Taxonomic Data: Where We Are and What is Missing. Systematic 
Biology 69 (6): 1231-1253. https://doi.org/10.1093/sysbio/syaa026 

Moersberger H, Valdez J, Martin JC, Junker J, Georgieva |, Bauer S, Beja P, Breeze T, 
Fernandez M, Fernandez N, Brotons L, Jandt U, Bruelheide H, Kissling WD, Langer C, 
Liquete C, Lumbierres M, Solheim AL, Maes J, Moran-Ordofiez A, Moreira F, Pe'er G, 
Santana J, Shamoun-Baranes J, Smets B, Capinha C, McCallum |, Pereira H, Bonn A 
(2024) Biodiversity monitoring in Europe: User and policy needs. Conservation Letters 17 
(5). https://doi.org/10.1111/conl.13038 

Page RM (2019) Ozymandias: a biodiversity knowledge graph. PeerJ 7 https://doi.org/ 
10.7717/peerj.6739 

Penev L, Koureas D, Groom Q, Lanfear J, Agosti D, Casino A, Miller J, Cochrane G, Ba 
n, O. K&, ljalg U, Ruch P (2024) Beyond BiCIKL: Towards Building an Al-Assisted 
"Biodiversity Supergraph". Biodiversity Information Science and Standards 8: 135550. 
https ://doi.org/10.3897/biss.8.135550 

Pyle R (2022) An Introduction to Scientific Names of Organisms, and the Taxon Concepts 
they Represent. Biodiversity Information Science and Standards 6 https://doi.org/10.3897/ 
biss.6.93926 


14 


Islam S 


Ratnasingham S, Hebert PN (2013) A DNA-Based Registry for All Animal Species: The 
Barcode Index Number (BIN) System. PLoS ONE 8 (7). https://doi.org/10.1371/ 
journal.pone.0066213 

Sandall E, Maureaud A, Guralnick R, McGeoch M, Sica Y, Rogan M, Booher D, Edwards 
R, Franz N, Ingenloff K, Lucas M, Marsh C, McGowan J, Pinkert S, Ranipeta A, Uetz P, 
Wieczorek J, Jetz W (2023) A globally integrated structure of taxonomy to support 
biodiversity science and conservation. Trends in Ecology & Evolution 38 (12): 
1143-1153. https ://doi.org/10.1016/j.tree.2023.08.004 

Sterner B, Gilbert E, Franz N (2020) Decentralized but Globally Coordinated Biodiversity 
Data. Frontiers in Big Data 3 https ://doi.org/10.3389/fdata.2020.519133 

Sterner B, Elliott S, Gilbert EE, Franz NM (2023) Unified and pluralistic ideals for data 
sharing and reuse in biodiversity. Database 2023 https ://doi.org/10.1093/database/ 
baad048 

Upham N, Poelen J (2024) Taxonomic Data Objects for Communicating the Meaning of 
Species Names. Biodiversity Information Science and Standards 8 httos://doi.org/ 
10.3897/biss.8.139413 

Verma A, Detsky A (2020) Preprints: a Timely Counterbalance for Big Data—Driven 
Research. Journal of General Internal Medicine 35 (7): 2179-2181. httos://doi.org/10.1007/ 
$11606-020-05746-w 

Waldchen J, Mader P (2018) Machine learning for image based species identification. 
Methods in Ecology and Evolution 9 (11): 2216-2225. httos://doi.org/10.1111/2041-210x. 
13075 

Yang B, Zhang Z, Yang C, Wang Y, Orr MC, Wang H, Zhang A (2021) Identification of 
Species by Combining Molecular and Morphological Data Using Convolutional Neural 
Networks. Systematic Biology 71 (3): 690-705. https ://doi.org/10.1093/sysbio/syab076